Intro to Pandas DataFrames

DataFrames are a lovely way to store data. They're essentially matrices that can store almost any type of data and are a great option for handling data where you want to keep track of rows and columns with labels. The Pandas library also comes with a handy host of functions that allow you to work with your DataFrames in very smart ways (eg. means, split-apply-combine).

The Pandas website has a lot of great documentation to help you get started. However, even after reading the official tutorials, I was stumped on some problems using DataFrames. This lesson is based directly off of how I solved those problems, so hopefully some will find it helpful.

Creating DataFrames

You can instantiate DataFrames without any data or with data from any number of sources



In [1]:

    
import pandas as pd
import numpy as np

blank = pd.DataFrame()

blank









    Out[1]:



In [2]:

    
movies = pd.DataFrame(np.zeros((4,4)), index=['Forrest Gump', 'Scanners', '2010: Odyssey Two', 'Fern Gully'], columns = ['Date Released', 'Box Office Gross', 'IMDB Score', 'Tomatometer'])

movies









    Out[2]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
    
  
  
    
      Forrest Gump
       0
       0
       0
       0
    
    
      Scanners
       0
       0
       0
       0
    
    
      2010: Odyssey Two
       0
       0
       0
       0
    
    
      Fern Gully
       0
       0
       0
       0



In [3]:

    
zeros = pd.DataFrame(np.zeros((3,5)))

zeros

Got a csv file you want to read? No problem!



In [4]:

    
probedata = pd.read_csv('12probe20cm.csv')

probedata[1:4]









    Out[4]:






  
    
      
      Animal
      Date
      Group
      Time to platform
      Trial duration
      Distance travelled (cm)
      Average speed
      % time near walls
      Direction
      Average Proximity
      Cumulative Proximity
      Whishaw's corridor
      Platform Quadrant
      Zone 1 %
      Target Zone %
      Zone 3 %
      Zone 4 %
    
  
  
    
      1
       m2
       2/1/2015
       GFP
       8.235
       59.898
       1298.611
       21.680
       2.516
       52.129
       47.077
       1232.864
       37.736
       4
        4.822
       50.943
       15.514
       0.629
    
    
      2
       m3
       2/1/2015
       GFP
       0.000
       59.944
       1562.190
       26.061
       2.096
       -0.502
       36.535
       1799.688
       54.717
       4
        5.451
       34.382
       10.273
       3.774
    
    
      3
       m4
       2/1/2015
       APP
       7.102
       59.896
       1157.948
       19.333
       0.000
       51.561
       41.952
       1392.255
       39.203
       4
       11.530
       37.945
       16.352
       0.210

Slicing DataFrames

You can slice DataFrames just like an array, and can also include some more advanced criteria



In [5]:

    
movies['Date Released'] = ['1994','1981','1984','1992']
movies['Box Office Gross'] = [50,12,8,16]
movies['IMDB Score'] = [8.8,6.8,6.8,6.4]
movies['Tomatometer'] = [.72,.80,.66,.71]

movies[1:2]

#OR

movies.iloc[1:2]









    Out[5]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
    
  
  
    
      Scanners
       1981
       12
       6.8
       0.8



In [6]:

    
movies[['IMDB Score','Tomatometer']]









    Out[6]:






  
    
      
      IMDB Score
      Tomatometer
    
  
  
    
      Forrest Gump
       8.8
       0.72
    
    
      Scanners
       6.8
       0.80
    
    
      2010: Odyssey Two
       6.8
       0.66
    
    
      Fern Gully
       6.4
       0.71



In [7]:

    
movies[movies['IMDB Score']<(movies['Tomatometer']*10)]









    Out[7]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
    
  
  
    
      Scanners
       1981
       12
       6.8
       0.80
    
    
      Fern Gully
       1992
       16
       6.4
       0.71

Adding new elements

Columns and rows can be added easily (relatively)

Columns can be added by setting a new column to another DataFrame. Just make sure that the indices are compatible!



In [8]:

    
favlist = pd.DataFrame([True,True,True,True],index=movies.index)

movies['Childhood Top 10?'] = favlist

movies









    Out[8]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
      Childhood Top 10?
    
  
  
    
      Forrest Gump
       1994
       50
       8.8
       0.72
       True
    
    
      Scanners
       1981
       12
       6.8
       0.80
       True
    
    
      2010: Odyssey Two
       1984
        8
       6.8
       0.66
       True
    
    
      Fern Gully
       1992
       16
       6.4
       0.71
       True

Rows can be added by using the append function on a DataFrame, taking an appropriately index and columned DataFrame that will be added on the end of the first DataFrame.



In [9]:

    
wildwildwest = pd.DataFrame(index=['Wild Wild West'], columns=movies.columns)

wildwildwest.iloc[0] = ['1999', 2, 4.8,.17,False]

movies.append(wildwildwest)









    Out[9]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
      Childhood Top 10?
    
  
  
    
      Forrest Gump
       1994
       50
       8.8
       0.72
        True
    
    
      Scanners
       1981
       12
       6.8
        0.8
        True
    
    
      2010: Odyssey Two
       1984
        8
       6.8
       0.66
        True
    
    
      Fern Gully
       1992
       16
       6.4
       0.71
        True
    
    
      Wild Wild West
       1999
        2
       4.8
       0.17
       False

The concat function is also very useful. It can handle DataFrames with different indices and/or columns. There are multiple ways to joining the indices and columns however you would like



In [11]:

    
pd.concat([movies,movies])









    Out[11]:






  
    
      
      Date Released
      Box Office Gross
      IMDB Score
      Tomatometer
      Childhood Top 10?
    
  
  
    
      Forrest Gump
       1994
       50
       8.8
       0.72
       True
    
    
      Scanners
       1981
       12
       6.8
       0.80
       True
    
    
      2010: Odyssey Two
       1984
        8
       6.8
       0.66
       True
    
    
      Fern Gully
       1992
       16
       6.4
       0.71
       True
    
    
      Forrest Gump
       1994
       50
       8.8
       0.72
       True
    
    
      Scanners
       1981
       12
       6.8
       0.80
       True
    
    
      2010: Odyssey Two
       1984
        8
       6.8
       0.66
       True
    
    
      Fern Gully
       1992
       16
       6.4
       0.71
       True

The Amazing GroupBy

Pandas.GroupBy is a great function that allows you to process your data in many different ways without having to get fancy or write any loops.

Essentially, it carries out three different steps:

Splitting the data into different groups [eg. treatment condition]

Applying some function to your data [eg. mean]

Combining the results back into another DataFrame

In my work, this function was extremely useful when I wanted to obtain the mean time spent in target zone for each of my treatment groups



In [12]:

    
probedata









    Out[12]:






  
    
      
      Animal
      Date
      Group
      Time to platform
      Trial duration
      Distance travelled (cm)
      Average speed
      % time near walls
      Direction
      Average Proximity
      Cumulative Proximity
      Whishaw's corridor
      Platform Quadrant
      Zone 1 %
      Target Zone %
      Zone 3 %
      Zone 4 %
    
  
  
    
      0
       m1
       2/1/2015
       APP
       10.724
       59.988
       1482.678
       24.716
       6.918
      -117.097
       45.391
       2064.426
       46.541
       4
        2.935
       17.820
       31.447
        8.805
    
    
      1
       m2
       2/1/2015
       GFP
        8.235
       59.898
       1298.611
       21.680
       2.516
        52.129
       47.077
       1232.864
       37.736
       4
        4.822
       50.943
       15.514
        0.629
    
    
      2
       m3
       2/1/2015
       GFP
        0.000
       59.944
       1562.190
       26.061
       2.096
        -0.502
       36.535
       1799.688
       54.717
       4
        5.451
       34.382
       10.273
        3.774
    
    
      3
       m4
       2/1/2015
       APP
        7.102
       59.896
       1157.948
       19.333
       0.000
        51.561
       41.952
       1392.255
       39.203
       4
       11.530
       37.945
       16.352
        0.210
    
    
      4
       m5
       2/1/2015
       GFP
       14.931
       59.912
       1310.405
       21.872
       0.000
       -21.198
       70.274
       1550.027
       49.476
       4
        6.709
       35.430
       18.658
        5.451
    
    
      5
       m6
       2/1/2015
       APP
       17.235
       59.948
       1348.853
       22.500
       0.419
      -131.100
       29.144
       1747.719
       61.845
       4
        6.080
       35.220
       14.046
        8.386
    
    
      6
       m7
       2/1/2015
       GFP
        5.086
       59.897
       1402.622
       23.417
       0.839
        30.670
       44.798
       1832.923
       45.073
       4
        7.128
       35.849
       18.658
        5.031
    
    
      7
       m8
       2/1/2015
       APP
        3.899
       59.969
       1686.095
       28.116
       6.289
       115.868
       49.414
       2482.494
       62.474
       4
        3.774
       15.094
       27.254
       13.627



In [13]:

    
grouped = probedata.groupby(['Group']).mean()

grouped









    Out[13]:






  
    
      
      Time to platform
      Trial duration
      Distance travelled (cm)
      Average speed
      % time near walls
      Direction
      Average Proximity
      Cumulative Proximity
      Whishaw's corridor
      Platform Quadrant
      Zone 1 %
      Target Zone %
      Zone 3 %
      Zone 4 %
    
    
      Group
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      APP
       9.740
       59.95025
       1418.8935
       23.66625
       3.40650
      -20.19200
       41.47525
       1921.7235
       52.51575
       4
       6.07975
       26.51975
       22.27475
       7.75700
    
    
      GFP
       7.063
       59.91275
       1393.4570
       23.25750
       1.36275
       15.27475
       49.67100
       1603.8755
       46.75050
       4
       6.02750
       39.15100
       15.77575
       3.72125



In [14]:

    
import matplotlib.pylab as plt
    import seaborn as sns
    
    sns.set_context("talk")
    grouped[['Zone 1 %','Target Zone %','Zone 3 %','Zone 4 %']].plot(kind='bar')
    plt.title('Time spent in Zone')
    plt.ylabel('Time (sec)')
    plt.xlabel('Zone')
    locs, labels = plt.xticks()
    plt.setp(labels,rotation=0)
    plt.show()



In [ ]:

	Date Released	Box Office Gross	IMDB Score	Tomatometer
Forrest Gump	0	0	0	0
Scanners	0	0	0	0
2010: Odyssey Two	0	0	0	0
Fern Gully	0	0	0	0

	Animal	Date	Group	Time to platform	Trial duration	Distance travelled (cm)	Average speed	% time near walls	Direction	Average Proximity	Cumulative Proximity	Whishaw's corridor	Platform Quadrant	Zone 1 %	Target Zone %	Zone 3 %	Zone 4 %
1	m2	2/1/2015	GFP	8.235	59.898	1298.611	21.680	2.516	52.129	47.077	1232.864	37.736	4	4.822	50.943	15.514	0.629
2	m3	2/1/2015	GFP	0.000	59.944	1562.190	26.061	2.096	-0.502	36.535	1799.688	54.717	4	5.451	34.382	10.273	3.774
3	m4	2/1/2015	APP	7.102	59.896	1157.948	19.333	0.000	51.561	41.952	1392.255	39.203	4	11.530	37.945	16.352	0.210

	IMDB Score	Tomatometer
Forrest Gump	8.8	0.72
Scanners	6.8	0.80
2010: Odyssey Two	6.8	0.66
Fern Gully	6.4	0.71

	Date Released	Box Office Gross	IMDB Score	Tomatometer	Childhood Top 10?
Forrest Gump	1994	50	8.8	0.72	True
Scanners	1981	12	6.8	0.80	True
2010: Odyssey Two	1984	8	6.8	0.66	True
Fern Gully	1992	16	6.4	0.71	True

	Animal	Date	Group	Time to platform	Trial duration	Distance travelled (cm)	Average speed	% time near walls	Direction	Average Proximity	Cumulative Proximity	Whishaw's corridor	Platform Quadrant	Zone 1 %	Target Zone %	Zone 3 %	Zone 4 %
0	m1	2/1/2015	APP	10.724	59.988	1482.678	24.716	6.918	-117.097	45.391	2064.426	46.541	4	2.935	17.820	31.447	8.805
1	m2	2/1/2015	GFP	8.235	59.898	1298.611	21.680	2.516	52.129	47.077	1232.864	37.736	4	4.822	50.943	15.514	0.629
2	m3	2/1/2015	GFP	0.000	59.944	1562.190	26.061	2.096	-0.502	36.535	1799.688	54.717	4	5.451	34.382	10.273	3.774
3	m4	2/1/2015	APP	7.102	59.896	1157.948	19.333	0.000	51.561	41.952	1392.255	39.203	4	11.530	37.945	16.352	0.210
4	m5	2/1/2015	GFP	14.931	59.912	1310.405	21.872	0.000	-21.198	70.274	1550.027	49.476	4	6.709	35.430	18.658	5.451
5	m6	2/1/2015	APP	17.235	59.948	1348.853	22.500	0.419	-131.100	29.144	1747.719	61.845	4	6.080	35.220	14.046	8.386
6	m7	2/1/2015	GFP	5.086	59.897	1402.622	23.417	0.839	30.670	44.798	1832.923	45.073	4	7.128	35.849	18.658	5.031
7	m8	2/1/2015	APP	3.899	59.969	1686.095	28.116	6.289	115.868	49.414	2482.494	62.474	4	3.774	15.094	27.254	13.627

	Time to platform	Trial duration	Distance travelled (cm)	Average speed	% time near walls	Direction	Average Proximity	Cumulative Proximity	Whishaw's corridor	Platform Quadrant	Zone 1 %	Target Zone %	Zone 3 %	Zone 4 %
Group
APP	9.740	59.95025	1418.8935	23.66625	3.40650	-20.19200	41.47525	1921.7235	52.51575	4	6.07975	26.51975	22.27475	7.75700
GFP	7.063	59.91275	1393.4570	23.25750	1.36275	15.27475	49.67100	1603.8755	46.75050	4	6.02750	39.15100	15.77575	3.72125